An accessible infrastructure for artificial intelligence using a Docker-based JupyterLab in Galaxy

Abstract Background Artificial intelligence (AI) programs that train on large datasets require powerful compute infrastructure consisting of several CPU cores and GPUs. JupyterLab provides an excellent framework for developing AI programs, but it needs to be hosted on such an infrastructure to enable faster training of AI programs using parallel computing. Findings An open-source, docker-based, and GPU-enabled JupyterLab infrastructure is developed that runs on the public compute infrastructure of Galaxy Europe consisting of thousands of CPU cores, many GPUs, and several petabytes of storage to rapidly prototype and develop end-to-end AI projects. Using a JupyterLab notebook, long-running AI model training programs can also be executed remotely to create trained models, represented in open neural network exchange (ONNX) format, and other output datasets in Galaxy. Other features include Git integration for version control, the option of creating and executing pipelines of notebooks, and multiple dashboards and packages for monitoring compute resources and visualization, respectively. Conclusions These features make JupyterLab in Galaxy Europe highly suitable for creating and managing AI projects. A recent scientific publication that predicts infected regions in COVID-19 computed tomography scan images is reproduced using various features of JupyterLab on Galaxy Europe. In addition, ColabFold, a faster implementation of AlphaFold2, is accessed in JupyterLab to predict the 3-dimensional structure of protein sequences. JupyterLab is accessible in 2 ways—one as an interactive Galaxy tool and the other by running the underlying Docker container. In both ways, long-running training can be executed on Galaxy’s compute infrastructure. Scripts to create the Docker container are available under MIT license at https://github.com/usegalaxy-eu/gpu-jupyterlab-docker.

For example, the single-cell field creates gene expression patterns for each cell that are represented as matrices of real numbers. The medical imaging field generates images of cells and tissues, radiography images such as chest x-rays and computerized tomography (CT) scans. Next-Generation sequencing generates deoxyribonucleic acid (DNA) sequences that are stored as fasta [1] files. Artificial intelligence (AI) approaches such as machine learning (ML) and deep learning (DL) have been vastly used with these datasets [2] for predictive tasks such as medical diagnosis, imputing missing features, augmenting datasets with artificially generated ones, and estimating gene expression patterns and many more. To be able to use ML and DL algorithms on such datasets, a robust and efficient compute infrastructure is needed that can serve multiple purposes. They include pre-processing raw datasets to transform them into suitable formats that are compatible with ML and DL algorithms, creating and executing their complex architectures on pre-processed datasets and making trained models and predicted datasets readily available for further analyses. To facilitate such tasks, a complete infrastructure is developed that combines jupyterlab [3] notebook, augmented with many useful features, running on public compute resources of Galaxy [4] Europe to perform end-to-end AI analyses on biological datasets. The infrastructure consists of three important components. First, a docker container [5] that encapsulates jupyterlab together with a large number of packages used for developing AI programs, data manipulation and visualisation. Section S2 in the supplementary file lists all the packages with their respective versions. Second, a Galaxy interactive tool [6,7] that downloads this docker container to serve jupyterlab online on Galaxy Europe. Third, the compute infrastructure consisting of several CPUs and GPUs that acts as a backend on which the online jupyterlab runs.

Docker container
Docker [8] containers are popular for shipping packaged software as complete ecosystems enabling them to be reproducible in a platform-independent manner. Software executing inside a docker container is abstracted from the operating system (OS) as most of the requirements necessary for them to run successfully are already configured inside the container. A container runs as an isolated environment making the minimum number of interactions to the host OS and thereby, improving the security aspects of the running software. Using a docker container leverages the security benefits that are necessary to have for online program editor softwares executing arbitrary code. In addition to minimising security risks, docker containers provide performance benefits compared to running programs on a virtual machine [9]. Motivated by such benefits, a docker container is used in this project to encapsulate jupyterlab along with many useful packages such as git [10], elyra AI [11], tensorflow-GPU [12], scikit-learn [13], CUDA [14], ONNX [15], and many others. Docker container inherits many packages such as numpy [16], scipy [17] and a few more from its base container, jupyter/tensorflow-notebook [18], and augments them with many other packages suitable for ML and DL, data manipulation and visualisation. Docker container is decoupled from Galaxy and can independently be executed for serving jupyterlab with the same set of packages on a different compute infrastructure or any personal computer (PC) or laptop having approximately 25 gigabytes (GB) of disk space. Moreover, it can easily be extended by installing suitable packages only by adding their appropriate package names in its dockerfile [19]. The approach extending the docker container is discussed in Methods section in detail.

Jupyterlab
Jupyterlab is a web-based, robust editor used for varied purposes such as data science, scientific computing, machine learning and deep learning. It is a program editor that supports more than 40 programming languages, some of the popular ones are python, R, julia and scala. Python is one of the most popular languages used by researchers for performing numerous scientific and predictive analyses. Therefore, it is used as the programming language in Galaxy jupyterlab notebooks because many popular packages such as tensorflow, scikit-learn for ML and DL, data manipulation packages such as pandas, visualisation packages such as seaborn [20], matplotlib [21], bokeh [22] and many others are readily available as python packages. Moreover, the extensible architecture of jupyterlab makes it possible to add many external packages as jupyterlab plugins such as git, elyra AI, dashboards and many others that have a user interface (UI) as necessary components. Editors such as jupyterlab, integrated with several useful packages, provide a favourable platform for both, rapid prototyping and end-to-end development and management of AI projects. To harness the benefits of jupyterlab, it has been used as the editor for the Jupyterlab interactive tool in Galaxy.

Features of jupyterlab notebook infrastructure
Many features such as easy accessibility, support of a wide variety of programming languages on jupyterlab, and extensibility to install useful plugins make it a desirable editor for researchers for creating project prototypes rapidly. Many such features have been integrated into a jupyterlab notebook infrastructure that serves jupyterlab notebook online on Galaxy Europe running on large compute resources enabling researchers to create prototypes and end-to-end AI projects ( Figure 1). A few important features are discussed here. To allow GPU computation from jupyterlab notebooks, tensorflow-GPU interacts with Nvidia GPU hardware using another software, CUDA, when the compute resource has GPU(s) for accelerating deep learning programs. Faster execution of deep learning programs is one of the significant features of jupyterlab hosted on Galaxy Europe. However, if the hosted machine on which a docker container runs does not have GPUs, then the program in a jupyterlab notebook relies on CPUs. Other useful features include ONNX for transforming trained tensorflow and scikitlearn models to ONNX models, open-CV [23] and scikit-image [24] for processing images, nibabel [25] for reading image files stored as ".nii", bioblend [26] for accessing Galaxy's datasets, histories and workflows in a jupyterlab notebook and visualisation packages such as bqplot [27] and bokeh for plotting interactive charts, voila [28] for displaying output cells of a jupyterlab notebook, dashboard such as nvdashboard [29] for monitoring GPU usage and performance. Support for the file extension such as H5 [30], efficient for storing matrices, enables machine learning researchers to save model weights and input datasets for AI algorithms. Other packages such as colabfold [31] together with JAX [32] are used for predicting 3D structures of proteins which are discussed in detail in the Results section. In addition, it is possible to create a long-running training job that runs remotely and stores trained models and output datasets permanently in a newly created Galaxy history. The trained model is saved as an ONNX file and tabular datasets in H5 file.

Related infrastructure
There are a few other infrastructures available, free and commercial, that offer jupyterlab or similar environments for de-| 3 veloping data science and AI projects. A few popular ones are google colab [33], kaggle kernels [34] and amazon's sagemaker [35]. Google colab is partially free and offers an online editor similar to a jupyterlab notebook. The free version of colab offers dynamic compute resources. The disk space is around 70 GB and the memory (RAM) is around 12 GB. These resources are scarce for the AI projects that deal with high-dimensional biological data [36,37]. In addition, these resources offered by colab are variable and depend on a user's past usage. More compute resources are assigned to those users that have used less in the past for a more equitable sharing of resources. Moreover, there is a limitation of only 12 hours of running time that may prove to be insufficient and non-ideal for training AI models on large datasets. However, colab pro and pro+ offer better compute resources but they come at a price, EUR 9.25 and EUR 42.25 per month, respectively. In contrast, kaggle kernels are free of charge but similar to colab, their computing resources are scarce. The total disk space is approximately 73 GB and RAM is 16 GB for a CPU-based kernel. For the GPU-based kernel, the disk space is of the same size as that of the CPUbased kernel but the RAM of the CPU decreases to 13 GB. An additional RAM of 15 GB is added through a GPU and computation time is limited to 30 hours a week. It also supports TPUs but the computation time is further limited to only 20 hours a week. Amazon's sagemaker is also a commercial software for developing AI algorithms that is free of charge but only for 2 months. Overall, these notebook infrastructures do not offer unrestricted compute resources free of charge. In addition, compute resources offered free of charge are insufficient for training AI models on high-dimensional biological datasets. To address the drawbacks of these notebook infrastructures and provide researchers and users large compute resources more reliably, Galaxy jupyterlab notebook infrastructure offers 1 terabyte (TB) of disk space and unlimited computation time on 1 GPU and 7 CPUs per session. RAM for GPU is around 15 GB and for CPUs is 20 GB ( Table 1). The offered resources for the jupyterlab notebook running in Galaxy stay constant and are independent of the user's past usage. To make it more useful, jupyterlab opens a tab for each notebook that allows researchers to develop and execute several notebooks inside the same session of the allotted compute resource rather than having them connect to a different session for each notebook as in google's colab and kaggle kernels.

Implementation
Jupyterlab infrastructure has been developed in two stages. First, a docker container is created containing all the necessary packages such as jupyterlab itself, CUDA, tensorflow, scikitlearn, ONNX and many more. The docker container is inherited from a base container that is well-suited for serving a jupyterlab environment. Such docker containers are collectively known as jupyter docker stacks [38]. In addition to software packages such as numpy, scipy and tensorflow that are already wrapped around the base container, many packages are added with their compatible versions. Compatible packages for CUDA, CUDA DNN and tensorflow are necessary so that they together interact with the GPU on the host machine for accelerating deep learning programs. Other significant packages, integrated into the docker container, are ONNX, scikit-learn, elyra AI, bioblend, nibabel, scikit-image, open-CV, bqplot and voila. The integrated docker container contains all the necessary packages for developing data science, ML and DL projects. Second, the container can be downloaded to any powerful compute infrastructure and jupyterlab can be served in any internet browser via the URL that it generates. In addition, to run this container in Galaxy, an interactive tool is created that down-loads this container on a remote compute infrastructure and generates a URL that is used to run jupyterlab in a browser. The architecture of jupyterlab infrastructure in Galaxy is shown in Figure 1. The running instance of jupyterlab in Galaxy contains a home page, a jupyterlab notebook, that summarises several of its features. Further, there are other notebooks available, each describing a feature of the jupyterlab with code examples such as how to create ONNX models for scikit-learn and tensorflow classifiers, how to connect to Galaxy using bioblend, how to create interactive plots using bqplots and how to create a pipeline of notebooks using elyra AI. To access the jupyterlab notebook in Galaxy Europe, a ready-to-use hands-on Galaxy training network (GTN) [39] tutorial [40] is developed that shows steps such as opening the notebook, using git to clone a code repository from github, and sending long-running training jobs to a remote Galaxy cluster. The approach of remote model training is explained in the Methods section. The two use-cases, explained in the Results section, are also discussed in the tutorial along with their respective jupyterlab notebooks.

Results
Jupyterlab notebook infrastructure in Galaxy Europe is used to reproduce the results of two recent scientific publications. They demonstrate its robustness and usefulness to develop deep learning models using COVID CT scan images [41] and predict the 3D structure of proteins using colabfold, a faster implementation of alphafold2 [42].

COVID-19 CT scan image segmentation
In [41], COVID-19 CT scan images have been used to develop and train a deep learning model architecture that predicts COVID-19 infected regions in those images with high accuracy. An open-source implementation of the work is available that trains a unet deep learning architecture [43] that distinguishes between normal and infected regions in CT scan images. Scripts of this implementation are adapted and executed on Galaxy's jupyterlab notebook infrastructure. Adaption only involves the transformation of all CT scan images, used in [41], into an H5 file so that they can directly be used as an input to the unet architecture defined in a jupyterlab notebook in [44]. A composite H5 file [45] is created using script [46] that contains multiple datasets inside and each dataset is a real-valued matrix corresponding to the training, test and validation sets as used in [41]. The entire analysis of [41] can be reproduced using multiple notebooks in [44]. They achieve similar values of precision and recall (approximately 0.98) metrics as mentioned in [41]. In [44], the first notebook (1_fetch_datasets.ipynb) downloads input dataset as an H5 file. Additionally, it also downloads the trained ONNX model. The second notebook (2_create_model_and_train.ipynb) creates and trains a unet model on the training dataset extracted from the H5 file. Training, accelerated by GPU, for 10 iterations over the entire training dataset finishes in a few minutes. The third notebook (3_predict_masks.ipynb) extracts the test dataset and predicts infected regions of the CT scan images in the test dataset using the trained model created by the second notebook. Figure 2 shows the comparison of ground truth infected regions in the second column and the predicted infected regions in the third column. A few original CT scan images from the test dataset are shown in the first column of Figure 2.

Predict 3D structure of proteins using ColabFold
Alphafold2 has made a breakthrough in predicting the 3D structure of proteins with outstanding accuracy. However, due to their large database size (a few TB), it is not easily accessible to researchers. Therefore, a few approaches have been developed that replace the time-consuming steps of alphafold2 with slightly different steps but predict the 3D structure of proteins with similar accuracy while consuming less memory and time.
One such approach is colabfold which replaces a large database search in alphafold2 for finding homologous sequences by a significantly (40-60 times) faster MMseqs2 API [47] call to generate input features based on the query protein sequence. Colabfold's prediction of 3D structures in batches is approximately 90 times faster. It is integrated into the docker container [5] by adding two packages -colabfold and GPU-enabled JAX which is a just-in-time compiler for making mathematical transformations. "7_ColabFold_MMseq2.ipynb" notebook in [44] predicts the 3D structure of a protein sequence using colabfold by making use of the alphafold2 pre-trained weights. Figure 3 shows the 3D structure of 4Oxalocrotonate_Tautomerase [48], a protein sequence of length 62, along with its side chains. This 3D structure is extremely similar to the structure predicted by the jupyter notebook [49] from colabfold in [31].

Remote model training
For large datasets, model training may need several hours or even days. In such cases, it would be non-ideal to keep the jupyterlab notebook open in a browser's tab till the training finishes. Therefore, another Galaxy tool [50] is developed to enable researchers to send long-running training jobs to a remote Galaxy cluster. The tool can be executed from a jupyterlab notebook using a custom python function [51], part of each jupyterlab notebook, that takes input datasets and training script as input parameters. The input datasets to be used for training, testing and validation must be provided in H5 format. It allows standardisation of input data format for AI models that train on matrices in jupyterlab notebooks. Input data to an AI model can be in multiple formats such as images, genomic sequences or gene expression patterns. H5 files can be created using any of these data formats and fed to the AI model in jupyterlab notebooks. Long-running training happens in a remote Galaxy cluster as a regular Galaxy job. Upon completion of the job, the resulting datasets and the trained model become available in a newly created Galaxy history [52]. The trained model and other resulting datasets can be downloaded from the Galaxy history for further analysis. In [44], a few notebooks are available that showcase the approach of remote model training. Notebook "4_create_model_and_train_remote.ipynb" contains code for developing and training a unet architecture. Notebook "5_run_remote_training.ipynb" executes the previous notebook on a cluster remotely after creating a Galaxy history and then uploading the script extracted from "4_create_model_and_train_remote.ipynb" notebook and input datasets. Custom python function, "run_script_job", creates a Galaxy history using bioblend and then uploads the datasets to the same history. After the upload is finished, the python script from the specified notebook is executed dynamically. It trains a deep learning model on the uploaded datasets to create a model and saves it as an ONNX file in the Galaxy history. Using "6_predict_masks_remote_model.ipynb" notebook from [44], the trained model can be downloaded from the Galaxy history and used for predicting infected regions of the CT scan images of the test dataset. A significant advantage of training deep learning models remotely is that researchers don't have to keep the jupyterlab notebook session running as long as the model is being trained as model training becomes decoupled from jupyterlab notebook. Using such a feature, deep learning models that take several hours or even days to train can be conveniently trained.

Extend docker container
The customised docker container developed as shown in Figure  1 can be easily extended to have more or different packages. To update the container, a package or a list of new packages should be added to the dockerfile and then a new container should be built. Once this new container is pushed to docker hub [5], the next time when Galaxy interactive tool is accessed, it downloads the updated container. All the newly added packages are available in the new container. Similarly, versions of existing packages should be updated or existing packages should be removed if no longer needed. The simplified extension procedure of the entire infrastructure makes it incur low maintenance costs as any change to this entire infrastructure is reflected only in the container without updating Galaxy's codebase.

Collaborative notebooks
Notebooks created in Galaxy's jupyterlab infrastructure are instantly shared with other researchers and collaborators only by sharing the public URL of a notebook. Researchers and users that share a notebook can collaborate on the same notebook without having to store it anywhere as it is directly served from Galaxy Europe.

Workflow of notebooks
Using the elyra AI package, a workflow of notebooks can be created and executed as one unit of software in similar ways as Galaxy workflows are created using several tools. It is possible to execute such workflows of jupyterlab notebooks on the same compute resource on which the jupyterlab notebook runs. In addition, a few other services such as kubeflow [53] or apache airflow [54] can also be used to deploy, run and manage such workflows on cloud.

Summary
Jupyterlab notebook is integrated as an interactive tool in Galaxy Europe running on a public and powerful compute infrastructure comprising several CPUs and GPUs. It is configured in a docker container along with many different packages such as CUDA, tensorflow-GPU and scikit-learn to provide a robust architecture for the development and management of projects in ML, DL and data science. Remote model training makes it convenient to run multiple analyses in parallel as different Galaxy jobs by executing the same Galaxy tool and results become available in different histories. Features such as git integration are useful for managing entire code repositories on github and elyra AI for creating pipelines of notebooks working as one unit of software. All notebooks created by a user run on the same session of the jupyterlab in different tabs. The entire infrastructure of jupyterlab is readily accessible through Galaxy Europe. In contrast to commercial infrastructures that host editors similar to jupyterlab and offer powerful and reliable compute only through paid subscriptions, this infrastructure provides large compute resources invariant to usage and has an unlimited usage time while ensuring the same set of compute resources across multiple usages.

Competing Interests
The authors declare that they have no competing interests.

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable

Authors' contributions
A.K. developed the project and wrote the manuscript. G.C. deployed the project on Galaxy Europe. B.G. devised the idea of the project. R.B. provided the necessary support for the entire project. All authors contributed to and approved the manuscript.

Findings Background
Bioinformatics comprises many sub-fields such as single-cell, medical imaging, sequencing, proteomics and many more that produce a huge amount of biological data in myriad formats. For exam-ple, the single-cell field creates gene expression patterns for each cell that are represented as matrices of real numbers. The medical imaging field generates images of cells and tissues, radiography images such as chest x-rays and computerized tomography (CT) scans. Next-Generation sequencing generates deoxyribonucleic acid (DNA) sequences that are stored as fasta [1] files. Artificial intelligence (AI) approaches such as machine learning (ML) and deep learning (DL) have been vastly used with these datasets [2] for predictive tasks such as medical diagnosis, imputing missing features, augmenting datasets with artificially generated ones, and estimating gene expression patterns and many more. To be able to use ML and DL algorithms on such datasets, a robust and efficient compute infrastructure is needed that can serve multiple purposes. They include pre-processing raw datasets to transform them into suitable formats that are compatible with ML and DL algorithms, creating and executing their complex architectures on pre-processed datasets and making trained models and predicted datasets readily available for further analyses. To facilitate such tasks, a complete infrastructure is developed that combines jupyterlab [3] notebook, augmented with many useful features, running on public compute resources of Galaxy [4] Europe to perform end-to-end AI analyses on biological datasets. The infrastructure consists of three important components. First, a docker container [5] that encapsulates jupyterlab together with a large number of packages used for developing AI programs, data manipulation and visualisation. Section S2 in the supplementary file lists all the packages with their respective versions. Second, a Galaxy interactive tool [6,7] that downloads this docker container to serve jupyterlab online on Galaxy Europe. Third, the compute infrastructure consisting of several CPUs and GPUs that acts as a backend on which the online jupyterlab runs.

Docker container
Docker [8] containers are popular for shipping packaged software as complete ecosystems enabling them to be reproducible in a platform-independent manner. Software executing inside a docker container is abstracted from the operating system (OS) as most of the requirements necessary for them to run successfully are already configured inside the container. A container runs as an isolated environment making the minimum number of interactions to the host OS and thereby, improving the security aspects of the running software. Using a docker container leverages the security benefits that are necessary to have for online program editor softwares executing arbitrary code. In addition to minimising security risks, docker containers provide performance benefits compared to running programs on a virtual machine [9]. Motivated by such benefits, a docker container is used in this project to encapsulate jupyterlab along with many useful packages such as git [10], elyra AI [11], tensorflow-GPU [12], scikit-learn [13], CUDA [14], ONNX [15], and many others. Docker container inherits many packages such as numpy [16], scipy [17] and a few more from its base container, jupyter/tensorflow-notebook [18], and augments them with many other packages suitable for ML and DL, data manipulation and visualisation. Docker container is decoupled from Galaxy and can independently be executed for serving jupyterlab with the same set of packages on a different compute infrastructure or any personal computer (PC) or laptop having approximately 25 gigabytes (GB) of disk space. Moreover, it can easily be extended by installing suitable packages only by adding their appropriate package names in its dockerfile [19]. The approach extending the docker container is discussed in Methods section in detail.

Jupyterlab
Jupyterlab is a web-based, robust editor used for varied purposes such as data science, scientific computing, machine learning and deep learning. It is a program editor that supports more than 40 programming languages, some of the popular ones are python, R, julia and scala. Python is one of the most popular languages used by researchers for performing numerous scientific and predictive analyses. Therefore, it is used as the programming language in Galaxy jupyterlab notebooks because many popular packages such as tensorflow, scikit-learn for ML and DL, data manipulation packages such as pandas, visualisation packages such as seaborn [20], matplotlib [21], bokeh [22] and many others are readily available as python packages. Moreover, the extensible architecture of jupyterlab makes it possible to add many external packages as jupyterlab plugins such as git, elyra AI, dashboards and many others that have a user interface (UI) as necessary components. Editors such as jupyterlab, integrated with several useful packages, provide a favourable platform for both, rapid prototyping and end-to-end development and management of AI projects. To harness the benefits of jupyterlab, it has been used as the editor for the Jupyterlab interactive tool in Galaxy.

Features of jupyterlab notebook infrastructure
Many features such as easy accessibility, support of a wide variety of programming languages on jupyterlab, and extensibility to install useful plugins make it a desirable editor for researchers for creating project prototypes rapidly. Many such features have been integrated into a jupyterlab notebook infrastructure that serves jupyterlab notebook online on Galaxy Europe running on large compute resources enabling researchers to create prototypes and end-to-end AI projects (Figure 1). A few important features are discussed here. To allow GPU computation from jupyterlab notebooks, tensorflow-GPU interacts with Nvidia GPU hardware using another software, CUDA, when the compute resource has GPU(s) for accelerating deep learning programs. Faster execution of deep learning programs is one of the significant features of jupyterlab hosted on Galaxy Europe. However, if the hosted machine on which a docker container runs does not have GPUs, then the program in a jupyterlab notebook relies on CPUs. Other useful features include ONNX for transforming trained tensorflow and scikit-learn models to ONNX models, open-CV [23] and scikit-image [24] for processing images, nibabel [25] for reading image files stored as ".nii", bioblend [26] for accessing Galaxy's datasets, histories and workflows in a jupyterlab notebook and visualisation packages such as bqplot [27] and bokeh for plotting interactive charts, voila [28] for displaying output cells of a jupyterlab notebook, dashboard such as nvdashboard [29] for monitoring GPU usage and performance. Support for the file extension such as H5 [30], efficient for storing matrices, enables machine learning researchers to save model weights and input datasets for AI algorithms. Other packages such as colabfold [31] together with JAX [32] are used for predicting 3D structures of proteins which are discussed in detail in the Results section. In addition, it is possible to create a long-running training job that runs remotely and stores trained models and output datasets permanently in a newly created Galaxy history. The trained model is saved as an ONNX file and tabular datasets in H5 file.

Related infrastructure
There are a few other infrastructures available, free and commercial, that offer jupyterlab or similar environments for developing data science and AI projects. A few popular ones are google colab [33], kaggle kernels [34] and amazon's sagemaker [35]. Google colab is partially free and offers an online editor similar to a jupyterlab notebook. The free version of colab offers dynamic compute resources. The disk space is around 70 GB and the memory (RAM) is around 12 GB. These resources are scarce for the AI projects that deal with high-dimensional biological data [36,37]. In addition, these resources offered by colab are variable and depend on a user's past usage. More compute resources are assigned to those users that have used less in the past for a more equitable sharing of resources. Moreover, there is a limitation of only 12 hours of running time that may prove to be insufficient and non-ideal for training AI models on large datasets. However, colab pro and pro+ offer better compute resources but they come at a price, EUR 9.25 and EUR 42.25 per month, respectively. In contrast, kaggle kernels are free of charge but similar to colab, their computing resources are scarce. The total | 3 disk space is approximately 73 GB and RAM is 16 GB for a CPU-based kernel. For the GPU-based kernel, the disk space is of the same size as that of the CPU-based kernel but the RAM of the CPU decreases to 13 GB. An additional RAM of 15 GB is added through a GPU and computation time is limited to 30 hours a week. It also supports TPUs but the computation time is further limited to only 20 hours a week. Amazon's sagemaker is also a commercial software for developing AI algorithms that is free of charge but only for 2 months. Overall, these notebook infrastructures do not offer unrestricted compute resources free of charge. In addition, compute resources offered free of charge are insufficient for training AI models on high-dimensional biological datasets. To address the drawbacks of these notebook infrastructures and provide researchers and users large compute resources more reliably, Galaxy jupyterlab notebook infrastructure offers 1 terabyte (TB) of disk space and unlimited computation time on 1 GPU and 7 CPUs per session. RAM for GPU is around 15 GB and for CPUs is 20 GB ( Table 1). The offered resources for the jupyterlab notebook running in Galaxy stay constant and are independent of the user's past usage. To make it more useful, jupyterlab opens a tab for each notebook that allows researchers to develop and execute several notebooks inside the same session of the allotted compute resource rather than having them connect to a different session for each notebook as in google's colab and kaggle kernels.

Implementation
Jupyterlab infrastructure has been developed in two stages. First, a docker container is created containing all the necessary packages such as jupyterlab itself, CUDA, tensorflow, scikit-learn, ONNX and many more. The docker container is inherited from a base container that is well-suited for serving a jupyterlab environment. Such docker containers are collectively known as jupyter docker stacks [38]. In addition to software packages such as numpy, scipy and tensorflow that are already wrapped around the base container, many packages are added with their compatible versions. Compatible packages for CUDA, CUDA DNN and tensorflow are necessary so that they together interact with the GPU on the host machine for accelerating deep learning programs. Other significant packages, integrated into the docker container, are ONNX, scikit-learn, elyra AI, bioblend, nibabel, scikit-image, open-CV, bqplot and voila. The integrated docker container contains all the necessary packages for developing data science, ML and DL projects. Second, the container can be downloaded to any powerful compute infrastructure and jupyterlab can be served in any internet browser via the URL that it generates. In addition, to run this container in Galaxy, an interactive tool is created that downloads this container on a remote compute infrastructure and generates a URL that is used to run jupyterlab in a browser. The architecture of jupyterlab infrastructure in Galaxy is shown in Figure 1. The running instance of jupyterlab in Galaxy contains a home page, a jupyterlab notebook, that summarises several of its features. Further, there are other notebooks available, each describing a feature of the jupyterlab with code examples such as how to create ONNX models for scikit-learn and tensorflow classifiers, how to connect to Galaxy using bioblend, how to create interactive plots using bqplots and how to create a pipeline of notebooks using elyra AI. To access the jupyterlab notebook in Galaxy Europe, a ready-to-use hands-on Galaxy training network (GTN) [39] tutorial [40] is developed that shows steps such as opening the notebook, using git to clone a code repository from github, and sending long-running training jobs to a remote Galaxy cluster. The approach of remote model training is explained in the Methods section. The two use-cases, explained in the Results section, are also discussed in the tutorial along with their respective jupyterlab notebooks.  [18] from which the customised container [5] is derived. In part (c), Galaxy's interactive tool downloads the customised container. The customised docker container can also be hosted on a different compute infrastructure. Part (d) shows Galaxy's jupyterlab.

Results
Jupyterlab notebook infrastructure in Galaxy Europe is used to reproduce the results of two recent scientific publications. They demonstrate its robustness and usefulness to develop deep learning models using COVID CT scan images [41] and predict the 3D structure of proteins using colabfold, a faster implementation of alphafold2 [42].

COVID-19 CT scan image segmentation
In [41], COVID-19 CT scan images have been used to develop and train a deep learning model architecture that predicts COVID-19 infected regions in those images with high accuracy. An open-source implementation of the work is available that trains a unet deep learning architecture [43] that distinguishes between normal and infected regions in CT scan images. Scripts of this implementation are adapted and executed on Galaxy's jupyterlab notebook infrastructure. Adaption only involves the transformation of all CT scan images, used in [41], into an H5 file so that they can directly be used as an input to the unet architecture defined in a jupyterlab notebook in [44]. A composite H5 file [45] is created using script [46] that contains multiple datasets inside and each dataset is a real-valued matrix corresponding to the training, test and validation sets as used in [41]. The entire analysis of [41] can be reproduced using multiple notebooks in [44]. They achieve similar values of precision and recall (approximately 0.98) metrics as mentioned in [41]. In [44], the first notebook (1_fetch_datasets.ipynb) downloads input dataset as an H5 file. Additionally, it also downloads the trained ONNX model. The second notebook (2_create_model_and_train.ipynb) creates and trains a unet model on the training dataset extracted from the H5 file. Training, accelerated by GPU, for 10 iterations over the entire training dataset finishes in a few minutes. The third notebook (3_predict_masks.ipynb) extracts the test dataset and predicts infected regions of the CT scan images in the test dataset using the trained model created by the second notebook. Figure 2 shows the comparison of ground truth infected regions in the second column and the predicted infected regions in the third column. A few original CT scan images from the test dataset are shown in the first column of Figure 2.

Predict 3D structure of proteins using ColabFold
Alphafold2 has made a breakthrough in predicting the 3D structure of proteins with outstanding accuracy. However, due to their large database size (a few TB), it is not easily accessible to researchers. Therefore, a few approaches have been developed that replace the time-consuming steps of alphafold2 with slightly different steps but predict the 3D structure of proteins with similar accuracy while consuming less memory and time. One such approach is colabfold which replaces a large database search in alphafold2 for finding homologous sequences by a significantly (40-60 times) faster MM-seqs2 API [47] call to generate input features based on the query protein sequence. Colabfold's prediction of 3D structures in batches is approximately 90 times faster. It is integrated into the docker container [5] by adding two packages -colabfold and GPU-enabled JAX which is a just-in-time compiler for making mathematical transformations. "7_ColabFold_MMseq2.ipynb" notebook in [44] predicts the 3D structure of a protein sequence using colabfold by making use of the alphafold2 pre-trained weights. Figure 3 shows the 3D structure of 4Oxalocrotonate_Tautomerase [48], a protein sequence of length 62, along with its side chains. This 3D structure is extremely similar to the structure predicted by the jupyter notebook [49] from colabfold in [31].

Remote model training
For large datasets, model training may need several hours or even days. In such cases, it would be non-ideal to keep the jupyterlab notebook open in a browser's tab till the training finishes. Therefore, another Galaxy tool [50] is developed to enable researchers to send long-running training jobs to a remote Galaxy cluster. The tool can be executed from a jupyterlab notebook using a custom python function [51], part of each jupyterlab notebook, that takes input datasets and training script as input parameters. The input datasets to be used for training, testing and validation must be provided in H5 format. It allows standardisation of input data format for AI models that train on matrices in jupyterlab notebooks. Input data to an AI model can be in multiple formats such as images, genomic sequences or gene expression patterns. H5 files can be created using any of these data formats and fed to the AI model in jupyterlab notebooks. Long-running training happens in a remote Galaxy cluster as a regular Galaxy job. Upon completion of the job, the resulting datasets and the trained model become available in a newly created Galaxy history [52]. The trained model and other resulting datasets can be downloaded from the Galaxy history for further analysis. In [44], a few notebooks are available that showcase the approach of remote model training. Notebook "4_create_model_and_train_remote.ipynb" contains code for developing and training a unet architecture. Notebook "5_run_remote_training.ipynb" executes the previous notebook on a cluster remotely after creating a Galaxy history and then uploading the script extracted from "4_cre-ate_model_and_train_remote.ipynb" notebook and input datasets. Custom python function, "run_script_job", creates a Galaxy history using bioblend and then uploads the datasets to the same history. After the upload is finished, the python script from the specified notebook is executed dynamically. It trains a deep learning model on the uploaded datasets to create a model and saves it as an ONNX file in the Galaxy history. Using "6_predict_masks_remote_model.ipynb" notebook from [44], the trained model can be downloaded from the Galaxy history and used for predicting infected regions of the CT scan images of the test dataset. A significant advantage of training deep learning models remotely is that researchers don't have to keep the jupyterlab notebook session running as long as the model is being trained as model training becomes decoupled from jupyterlab notebook. Using such a feature, deep learning models that take several hours or even days to train can be conveniently trained.

Extend docker container
The customised docker container developed as shown in Figure  1 can be easily extended to have more or different packages. To update the container, a package or a list of new packages should be added to the dockerfile and then a new container should be built. Once this new container is pushed to docker hub [5], the next time when Galaxy interactive tool is accessed, it downloads the updated container. All the newly added packages are available in the new container. Similarly, versions of existing packages should be updated or existing packages should be removed if no longer needed. The simplified extension procedure of the entire infrastructure makes it incur low maintenance costs as any change to this entire infrastructure is reflected only in the container without updating Galaxy's codebase.

Collaborative notebooks
Notebooks created in Galaxy's jupyterlab infrastructure are instantly shared with other researchers and collaborators only by sharing the public URL of a notebook. Researchers and users that share a notebook can collaborate on the same notebook without having to store it anywhere as it is directly served from Galaxy Europe.

Workflow of notebooks
Using the elyra AI package, a workflow of notebooks can be created and executed as one unit of software in similar ways as Galaxy workflows are created using several tools. It is possible to execute such workflows of jupyterlab notebooks on the same compute resource on which the jupyterlab notebook runs. In addition, a few other services such as kubeflow [53] or apache airflow [54] can also be used to deploy, run and manage such workflows on cloud.

Summary
Jupyterlab notebook is integrated as an interactive tool in Galaxy Europe running on a public and powerful compute infrastructure comprising several CPUs and GPUs. It is configured in a docker container along with many different packages such as CUDA, tensorflow-GPU and scikit-learn to provide a robust architecture for the development and management of projects in ML, DL and data science. Remote model training makes it convenient to run multiple analyses in parallel as different Galaxy jobs by executing the same Galaxy tool and results become available in different histories. Features such as git integration are useful for managing entire code repositories on github and elyra AI for creating pipelines of notebooks working as one unit of software. All notebooks created by a user run on the same session of the jupyterlab in different tabs. The entire infrastructure of jupyterlab is readily accessible through Galaxy Europe. In contrast to commercial infrastructures that host editors similar to jupyterlab and offer powerful and reliable compute only through paid subscriptions, this infrastructure provides large compute resources invariant to usage and has an unlimited usage time while ensuring the same set of compute resources across multiple usages.
Availability of supporting source code and requirements